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(57) Abstract 



Methods and apparatus are disclosed for reducing dis- 
continuities between frames of sinusoidally modeled acoustic 
waveforms, such as speech, which occur when sampling at 
low frame rates, A Fast Fourier Transform-based overlap- 
add technique (28) is applied to amplitude (A), frequency <a 
and phase $ components of sinusoidal waves after frame-to- 
frame sine wave matching has been performed (20). Matched 
sine wave amplitudes (A) and frequencies co^_are linearly in- 
terpolated (26) and a mid-point phase (3(M)) is estimated 
such that the mid-frame sine wave is best fit to the most re- 
cent half-frame segments of the lagging and leading sine 
waves. Synthetic mid-frame sine waves are generated (28) us- 
ing the interpolated amplitude and frequency and estimated 
phase values. Synthesized acoustic waveforms of high quality 
from original source waveforms can be produced in sinusoid- 
al analysis/synthesis operations at coding frame rates of 50 
Hz and lower. 







tSTlMAWI 






-I 5 l 

■ i MSt 

L-n isrmmt j—M 



TFT-iASEQ 
OVERLAP-AM -A00 

S/Menve 
omwot 

T — 

OUTPUT 
9AKF0RM 



BNSOOCID:<WO 8909985A1> 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international appli- 
cations under the PCT. 



AT 


Austria 


FR 


France 


ML 


Mali 


AU 


Australia 


GA 


Gabon 


MR 


Mauritania 


BB 


Barbados 


GB 


United Kingdom 


MW 


Malawi 


BE 


Belgium 


HU 


Hungary 


NL 


Netherlands 


BG 


Bulgaria 


rr 


Italy 


NO 


Norway 


BJ 


Benin 


jp 


Japan 


RO 


Romania 


BR 


Brazil 


KP 


Democratic People's Republic 


SD 


Sudan 


CF 


Central African Republic 




of Korea 


SE 


Sweden 


CG 


Congo 


KR 


Republic of Korea 


SN 


Senegal 


CH 


Switzerland 


LI 


Liechtenstein 


SU 


Soviet Union 


CM 


Cameroon 


UC 


Sri T. an lea 


TD 


Chad 


DE 


Germany, Federal Republic of 


LU 


Luxembourg 


TG 


Togo 


DK 


Denmark 


MC 


Monaco 


US 


United States of America 


FI 


Finland 


MG 


Madagascar 







8NS0OCID:<WO 89099S5A1 > 



WO 89/09985 



PCT/US89/01378 



COMPUTATIONALLY EFFICIENT SINE WAVE SYNTHESIS 
FOR ACOUSTIC WAVEFORM PROCESSING 

The U.S. Government has rights in this 
invention pursuant to the Department- of the Air Force 
Contract No. F19-028-85-C-0002. 

Reference to Related Application 

This application is a continuation-in-part 
of U.S. Serial No. 712,866/ "Processing of Acoustic 
Waveforms," filed March 18, 1985, incorporated herein 
by reference. 

Background of the Invention 

The field of this invention is speech 
technology generally and, in particular, methods and 
devices for analyzing, digitally encoding and 
synthesizing speech or other acoustic waveforms. 

Systems for digital encoding and synthesis 
of speech are the subject of considerable present 
interest, particularly at rates compatible with 
existing transmission lines, which commonly carry 
digital information at 2.4 - 9.6 kilobits per 
second. At such rates, conventional systems based 
upon speech waveform modeling are inadequate for 
coding applications and yield poor quality speech 
transmission, even if linear predictive coding (LPC) 
and other efficient coding techniques are used. 
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Typically, the problem of representing 
speech signals is approached by using a speech 
production model in which speech is viewed as the 
result of passing a glottal excitation waveform 
through a time-varying, linear filter that models the 
resonant characteristics of the vocal tract . In a 
so-called "binary excitation model, " it is assumed 
that the glottal excitation can be in one of two 
possible states corresponding to voiced or unvoiced 
speech. 

In the voiced speech state, the excitation 
is periodic with a period which is allowed to vary 
slowly over time relative to the analysis frame rate, 
typically 10-20 msecs. For the unvoiced speech 
state, the glottal excitation is modeled as random 
noise with a flat spectrum. In both cases, the power 
level in the excitation is also considered to be 
slowly time-varying. 

While this binary model has been used 
successfully to design narrowband vocoders and speech 
synthesis systems, its limitations are well known. 
For example, the speech excitation is often mixed, 
having both voiced and unvoiced components 
simultaneously, and often only portions of the 
spectrum are truly harmonic. Additionally, the 
binary model requires that each frame of data be 
classified as either voiced or unvoiced, a decision 
which is difficult to make if the speech is subject 
to additive acoustic noise. 
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The above-referenced parent application, 
U.S. Serial No, 712,866, discloses an alternative to 
the binary excitation model in which speech analysis 
and synthesis, as well as coding, can be accomplished 
simply and effectively by employing a time-frequency 
representation of the speech waveform which is 
independent of the speech state. In particular, a 
sinusoidal model for the speech waveform is utilized 
to develop a new analysis and synthesis method. 

The basic method of U.S. Serial No. 712,866 
includes the steps of (i) selecting frames — i.e. 
windows of approximately 20 - 60 milliseconds — of 
samples from the waveform; (ii) analyzing each frame 
of samples to extract a set of frequency components; 
(iii) tracking the components from one frame to the 
next; and (iv) interpolating the values of the 
components from one frame to the next to obtain a 
parametric representation of the waveform. A 
synthetic waveform can then be constructed by 
generating a set of sine waves corresponding to the 
parametric representation. The disclosures of U.S. 
Serial No. 712,866 are incorporated herein by 
reference. 

In one illustrated embodiment described in 
detail in U.S. Serial No. 712,866, the basic method 
is utilized to select amplitudes, frequencies and 
phases corresponding to the largest peaks in a 
periodogram of the measured signal, independently of 
the speech state. In order to reconstruct the speech 
waveform, the amplitudes, frequencies and phases of 
the sine waves estimated on one frame are matched and 
allowed to continuously evolve into the corresponding 
parameter set on the next frame. 
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Because the number of estimated peaks is not 
constant and is slowly varying , the matching process 
is not straightforward. Rapidly varying regions of 
speech, such as unvoiced/voiced transitions, can 
result in large changes in both the location and 
number of peaks. 

To account for such rapid movements in 
spectral energy, the concept of -birth" and "death" 
of sinusoidal components is employed in a 
nearest-neighbor matching method based on the 
frequencies estimated on each frame. If a new peak 
appears, a "birth" is said to occur and a new track 
is initiated. If an old peak is not matched, a 
"death" is said to occur and the corresponding track 
is allowed to decay to zero. 

Once the parameters on successive frames 
have been matched, phase continuity of each 
sinusoidal component is ensured by unwrapping the 
phase. In one embodiment described in U.S. Serial 
No. 712,866, the phase is unwrapped using a cubic 
phase interpolation function having parameter values 
that are chosen to satisfy the measured phase and 
frequency constraints at the frame boundaries while 
maintaining maximal smoothness over the frame 
duration. 

In the final step of the illustrated 
embodiment, the corresponding sinusoidal amplitudes 
are interpolated in a linear manner across each frame. 
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In speech coding applications, U.S. Serial 
No. 712,866 teaches that pitch estimates can be used 
to establish a set of harmonic frequency bins to 
which frequency components are assigned. The term 
"pitch" is used herein to denote the fundamental rate 
at which a speaker's vocal chords are vibrating. The 
amplitudes of the components are coded directly using 
adaptive differential pulse code modulation (ADPCM) 
across frequency, or indirectly using linear 
predictive coding (LPC) . 

In one embodiment of the coder, the peak in 
each harmonic frequency bin having the largest 
amplitude is selected and assigned to the frequency 
at the center of the bin. This results in a harmonic 
series based upon the coded pitch period. An 
amplitude envelope can then be constructed by 
connecting the resulting set of peaks and later 
sampled in a pitch-adaptive fashion (either linearly 
or non-linearly) to provide efficient coding at 
various bit rates. The phases can then be coded by 
measuring the phases of the edited peaks and then 
coding such phases using 4 to 5 bits per phase peak. 
Further details on coding acoustic waveforms in 
accordance with applicants' sinusoidal analysis 
techniques can be found in commonly-owned, copending 
U.S. Patent Application Serial No. 034,097, entitled 
"Coding of Acoustic Waveforms," incorporated herein 
by reference. 
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Analysis/synthesis systems constructed 
according to the invention disclosed in U.S. Serial 
No. 712,866, based on a sinusoidal representation of 
speech, yield synthetic speech that is essentially 
indistinguishable from the original. Coding 
techniques as disclosed in U.S. Serial No. 034,097 
have led to the realization of multi-rate coders 
operating at rates from 2.4 to 9.6 kilobits per 
second. Such systems produce synthetic speech that 
is very intelligible at all rates and, in general, 
produce speech having progressively improving quality 
as the data rate is increased. 

A practical limitation of the sinusoidal 
technique has been the computational complexity 
required to perform the sinusoidal synthesis. This 
complexity results because it is typically necessary 
to generate each sine wave on a per-sample basis and 
then sum the resulting set of sine waves. Good 
performance can be achieved in sinusoidal 
analysis/synthesis while operating at a 50 Hz frame 
rate, provided that the sine wave frequencies are 
matched from frame to frame and that either cubic 
phase or piece-wise quadratic phase interpolators are 
used to ensure consistency between the measured 
frequencies and phases at the frame boundaries. The 
disadvantage of this approach is the computational 
overhead associated with the interpolation process. 
Even if very powerful 125 nanosecond/cycle 
microprocessors are utilized, such as the ADSP2100 
DSP integrated circuits manufactured by Analog 
Devices (Norwood, MA), two such microprocessors 
typically are required to synthesize 80 sine waves. 
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An alternative method for performing 
sinusoidal synthesis includes constructing a set of 
sine waves having constant amplitudes, frequencies 
and linearly-varying phases, applying a triangular 
window of twice the frame size, and then utilizing an 
overlap-and-add technique in conjunction with the 
sine waves generated on the previous frame. Such a 
set of sine waves can also be generated using 
conventional Fast Fourier Transform (FFT) methods. 
In this approach, a Fast Fourier Transform (FFT) 
buffer is filled out with non-zero entries at the 
sine wave frequencies, an inverse FFT is executed, 
and then the overlap-and-add technique is applied. 
This process also leads to synthetic speech that is 
perceptually indistinguishable from the original, 
provided the frame rate is approximately 100 Hz 
(lOms/f rame) . 

However, for low-rate coding applications, 
it is necessary to operate at a 50 Hz frame rate 
(20ms/frame) or lower. At these frame rates, the FFT 
overlap-and-add method yields synthetic speech that 
sounds "rough" because the triangular parametric 
window is at least 40ms wide, and this is too long a 
period compared to the rate of change of the vocal 
tract and vocal chord articulators. 

An apparatus for computationally efficient 
coding of acoustic waveforms at frame rates of 50 Hz 
or less, without the "roughness" produced at low 
coding rates by the above-described methods, would 
meet a substantial need. In particular, speech 
processing devices and methods that reduce 
frame-to-frame discontinuities at low coding rates 
would be particularly advantageous for coding of 
speech. 
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Accordingly, there exists a need for 
computationally efficient methods and devices for 
synthesizing sine waves for speech coding, analysis 
and synthesis systems which operate at low coding 
rates requiring frame rates of 50 Hz and below. In 
particular, techniques and apparatus for efficient 
synthesis of sine waves in connection with sinusoidal 
transform coding would satisfy long-felt needs and 
provide substantial contributions to the art. 
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Summary of the Invention 

Sine wave synthesis and coding systems are 
further disclosed for processing acoustic waveforms 
based on Fast Fourier Transform (FFT) overlap-and-add 
techniques. A technique for sine wave synthesis is 
disclosed which relieves computational choke points 
by generating mid-frame sine wave parameters, thereby 
reducing frame-to-frame discontinuities, particularly 
at low coding rates. The technique is applied to the 
sinusoidal model after the frame-to-frame sine wave 
matching has been performed. Mid-frame values are 
obtained by linearly interpolating the matched sine 
wave amplitudes and frequencies and estimating a 
mid-point phase, such that the mid-frame sine wave is 
best fit to the most recent half-frame segments of 
the lagging and leading sine waves. 

For example, the invention provides methods 
and apparatus for receiving sets of sine wave 
parameters every 20ms and for implementing an 
interpolation technique that allows for resynthesis 
every 10ms. 
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In synthesizing the mid-frame sine wave 
components, the mid-frame phase can be estimated as 
follows : 

9<M) = <e 0 +9i)/2 + (<* 0 -<ai)/2W4 + irM 

where M is an integer whose value is chosen such that 
irM is closest to 

<9 o" 9 l >/2 + (« 0 +« 1 )/2»H/4 

and where 0 Q is the phase of the lagging frame, 

9, is the phase of the leading frame, o is 

x o 
the frequency of the lagging frame, is the 

frequency of the leading frame, and N is the analysis 
frame length. 

In another aspect of the invention, a system 
is disclosed which provides improved quality, 
particularly for low-rate speech coding applications 
where the speech has been corrupted by additive 
acoustic noise. For high pitched speakers 
especially, background noise can have a tonal quality 
when resynthesized that can be annoying if the 
signal-to-noise (SNR) ratio is low. When a 
pitch-adaptive analysis window is used, the window 
will be short for high pitched speakers and, when 
applied to the noise, will result in relatively few 
resolved sine waves. The resulting synthetic noise 
then sounds tonal. In addition to reducing the 
frame-to-frame discontinuities, the present invention 
suppresses this tonal noise and replaces it with a 
more "noise-like" signal which improves the 
robustness of the system. 
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In one embodiment of the noise compensating 
system, the receiver can employ a voicing measure to 
determine highly unvoiced frames (i.e., noisy 
frames), and the spectra for successive noisy frames 
can then be averaged to obtain an average background 
noise spectrum. This information can be used to 
suppress the synthesized noise at the harmonics in 
accordance with the SNR at each harmonic and used to 
replace the suppressed noise with a broad band noise 
having the same spectral characteristic. 

Methods are also disclosed for phase 
regeneration of sine waves for which no phase coding 
is possible. At low data rates (e.g., 2.4 kbps and 
below), it is typically not possible to code any of 
the sine wave phases. Thus, in another aspect of the 
invention, techniques are disclosed to reconstruct an 
appropriate set of phases for use in synthesis, based 
on an assumption that all the sine waves should come 
into phase every pitch onset time. Reconstruction is 
achieved by defining a phase function for the pitch 
fundamental obtained by integration of the 
instantaneous pitch frequency. 

The invention will next be described in 
connection with certain illustrated embodiments. 
However, it should be clear that various changes and 
modifications can be made by those skilled in the art 
without departing from the spirit and scope of the 
invention, as defined by the claims. For example, 
although the description that follows is particularly 
adapted to speech coding, it should be clear that 
various other acoustic waveforms can be processed in 
a similar fashion. 
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Brief Description of the Drawings 

For a more thorough understanding of the 
nature and objects of the invention, reference should 
be had to the following detailed description and to 
the drawings, in which: 

FIG. 1 is an illustration of a simple 
overlap-and-add interpolation technique in accordance 
with the invention, showing a triangular parametric 
window applied to sine wave parameters obtained at 
frame boundaries to generate interpolated values 
between those measured at frame boundaries; 

FIG. 2 is an illustration of a further 
application of overlap-and-add interpolation 
techniques according to the invention, showing the 
generation of an artificial mid-frame sine wave to 
reduce the discontinuities in the resynthesized 
waveform at low coding rates; 

FIG. 3 is a flow chart showing the steps of 
a method of mid-frame sine wave synthesis according 
to the invention; 

FIG. 4 is a schematic block diagram of a 
mid-frame sine wave synthesis system according to the 
invention; and 

FIG. 5 is a further schematic block diagram 
showing a noise suppressing receiver structure 
according to the invention. 
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Detailed Description 

In the present invention the speech waveform 
is modeled as a sum of sine waves. If s(n) 
represents the sampled speech waveform, then 

s(n) « ZA i (n)cos[9 i (n)] 

(1) 

where A^(n) and Q^(n) are the time-varying 
amplitudes and phases of the i'th tone. 

To obtain a representation of the waveform 
over time, frequency components measured on one 
analysis frame must be matched with frequency 
components that are obtained on a successive frame. 
In particular, a frequency component from one frame 
must be matched with a frequency component in the 
next frame having the "closest" value. The matching 
technique is described in more detail in parent case 
U.S. Serial No. 712,866, herein incorporated by 
reference. Once matched, the values of the 
components from one frame to the next must be 
interpolated to obtain a parametric representation in 
which the sine waves of one frame evolve into the 
corresponding parameter set of the next frame. 
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FIG. 1 illustrates the basic process of 
interpolating exemplary frequency components for 
frames K and K+l in accordance with the invention by 
the overlap-and-add method. The triangular windows A 
and B shown in FIG. 1 are used to interpolate the 
sine' wave components from frame K to frame K+l. in 
the overlap-and-add method of filling in data values, 
the triangular window is applied to the resulting 
sine waves generated during each frame. The 
overlapped values in region C are then summed to fill 
in the values between those measured at the frame 
boundaries. 

The overlap/add technique illustrated in 
FIG. 1 yields good performance for sampling rates 
near 100 Hz, i.e. 10 ms frames. However, for most 
coding applications, sampling rates of approximately 
50 Hz, i.e. 20 ms frames, are required. When the 
overlap-and-add interpolation technique shown in FIG, 
1 is used, in this case, the triangular window is 
effectively 40 ms wide, which assumes a stationarity 
that is too long relative to the rate of change of 
the human vocal tract and vocal chord articulators, 
and significant frame to frame discontinuities 
result. Thus, a further preferred embodiment of the 
invention provides a method for minimizing such 
discontinuities . 
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If A Q , ta Q , and B Q represent the 
amplitude, frequency and phase of a sine wave on 
frame K and A^, and 9^^ represent the 

amplitude, frequency and phase of the matched sine 
wave on frame K+l, then the equations: 

A = (A 0 + Ax)/2 
and 

a) = <o) 0 + 6)i)/2 

represent a good approximation of the true amplitude 
and frequency at the mid-point between frame K and 
frame K+l. Equations 2 and 3 represent one set of 
interpolation functions which can be used to fill in 
data values between those measured at frame 
boundaries. 

In order to minimize any discontinuity 
between the sine wave at frame K and its transition 
to the synthetic sine wave at the mid-point and 
between the synthetic sine wave and its transition to 
the sine wave at frame K+l, the invention calculates 
a phase that yields the minimum mean-squared-error at 
times N/4 and 3N/4, where N is the analysis frame 
length. This phase is calculated according to the 
equation: 



9(M) = (9 0 +9i)/2 + (co 0 -w 1 )/2*N/4 + irM 

(4) 

where M is an integer whose value is chosen, such 
that irM is closest to 

(9 Q -9 1 )/2 + (a) o + Wl )/2»N/4 

(5) 
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In accordance with this preferred embodiment 
of the invention, an artificial set of mid-frame sine 
waves is generated by applying the above 
interpolation rules for all of the matched sine waves 
and then applying a conventional FFT overlap-and-add 
technique. FIG. 2 illustrates this overlap-and-add 
interpolation technique, showing an artificial sine 
wave between frame K and frame K+l. The artificial 
sine wave S(n) / generated with values provided by the 
above interpolation rules, reduces the 
discontinuities between S Q (n) and S^n) shown in 
FIG. 2. Because the effective stationarity has been 
reduced from 40 ms to 20 ms, the resulting synthetic 
speech is no longer "rough." Hence, the invention 
provides a method for doubling the effective 
synthesis rate with no increase in the actual 
transmission frame rate. 

In FIG. 3, a flow chart of the processing 
steps for interpolation using synthetic mid-frame 
parameters according to the invention is shown. Sine 
wave parameters for each frame are received and 
sampled every T ms, where T is the frame period for 
frames K and K+l. The sine wave parameters include 
amplitude A, frequency o> and phase 9. 

The frequency components for frames K and 
K+l are then matched, preferably according to the 
method described in U.S. Serial No. 712,866, and a 
mid-frame sine wave is constructed having an 
amplitude and frequency given by Equations 2 and 3, 
and a phase is estimated for each sine wave 
component, in accordance with Equation 4 above, such 
that each mid-frame sine wave is best fit to the most 
recent half-frame segments of the lagging and leading 
sine waves. 
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In the final step, the overlap-and-add 
technique is applied to interpolate between the frame 
K and mid-frame values and, likewise, to interpolate 
between the mid-frame and frame K+l values in order 
to synthesize a set of waveforms at a virtual rate of 
T/2 ms. Thus, the synthetic waveform reduces the 
discontinuities . between the frame K and frame K+l 
waveforms, in effect generating an artificial frame 
half the duration of the actual frame. 

FIG. 4 is a block diagram of an acoustic 
waveform processing apparatus, according to the 
invention. The transmitter 10 includes sine waves 
parameter estimator 12 which samples the input 
acoustic waveform to obtain a discrete samples and 
generates a series of frames, each frame spanning a 
plurality of samples. The estimator 12 further 
includes means for extracting a set of frequency 
components having discrete amplitudes and phases. 
The amplitude, frequency and phase information 
extracted from the sampled frames of the input 
waveform is coded by coder 14 for transmission. The 
sampling, analyzing and coding functions of elements 
12 and 14 are more fully discussed in U.S. Serial No. 
712,866, as well as U.S. Serial No. 034,097 also 
incorporated herein by reference. 

In the receiver section 16, the coded 
amplitude, frequency and phase information is decoded 
by decoder 18 and then analyzed by frequency tracker 
20 to match frequency components from one frame to 
the next. 
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The interpolator 22 interpolates the values 
of components from one frame to the next frame to 
obtain a parametric representation of the waveform, 
so that a synthetic waveform can be synthesized by 
generating a set of sine waves corresponding to the 
interpolated values of the parametric representation. 

In a preferred embodiment of the invention, 
the interpolator 22 includes a mid-frame phase 
estimator 24 which implements a "best fit* phase 
calculation, in accordance with Equations 4 and 5 
above, and a linear interpolator 20, which linearly 
interpolates matched amplitude and frequency 
components from one frame to the next frame. The 
apparatus 10 further includes an FFT-based sine wave 
generator 28 which performs an overlap-and-add 
function utilizing Fourier analysis. 

The generator 28 further includes means for 
filling a buffer with amplitude and phase values at 
the sine wave frequencies, means for taking an 
inverse FFT of the buffered values, and means for 
performing an overlap-and-add operation with 
transformed values and those obtained from the 
previous frame. 

Moreover, as shown generally in FIG. 4, the 
apparatus 10 can also optionally include a noise 
estimator and generator 30. For high-pitched 
speakers especially, the background noise has a tonal 
quality that can become quite annoying, particularly 
when the signal-to-noise ration (SNR) is low. The 
noise dependence on pitch is due to the fact that the 
analysis window typically is set at two and one-half 
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times the average pitch. Hence, for a high-pitched 
speaker, the window will be short (but no less than 
20 ms) which, when applied to the noise, results in 
relatively few resolved sine waves. The resulting 
synthetic noise then sounds tonal. Conversely, for 
low-pitched speakers, the window will be quite long. 
This results in a more resolved noise spectra which 
leads to a larger number of sine waves for synthesis, 
which in turn, sounds more "noise-like, " that is to 
say, less tonal. 

In FIG. 5, a noise correction system 30 
according to the invention is shown in more detail. 
The noise correction system 30 operates in concert 
with a speech (or other acoustic waveform) 
synthesizer 32 (e.g., frequency tracking, 
interpolating and sine wave generating circuitry as 
described above in connection with FIG. 4), and 
includes a noise envelope estimator 34, a noise 
suppression filter 36, a broadband noise generator 
38, and a summer 40. The noise envelope estimator 34 
estimates the noise envelope parameters from decoded 
sine waves and voicing measurements, as discussed in 
more detail below. These noise envelope parameters 
drive the noise suppression filter 36 to modify the 
waveforms from synthesizer 32 and also drive the 
broadband noise generator 38. The modified, 
synthetic waveforms and broadband noise are then 
added in summer 40 to obtain the output waveform in 
which "tonal" noise is essentially eliminated. 
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Although the noise correction system 30 is 
illustrated by discrete elements, it should be 
apparent that the functions of some or all of these 
elements can be combined in operation. For example, 
the noise correction system can be implemented as 
part of the synthesizer, itself, by applying noise 
attenuation factors to the harmonic entries in a 
FFT-buffer during the synthesis operations and 
implementation of the broadband noise can be 
accomplished by adding predetermined randomizing 
factors to the amplitudes and phases of all of the 
FFT buffer entries prior to synthesis. 

Since the system of the present invention is 
essentially linear, the envelope of the speech plus 
noise spectra and the envelope of the noise spectra 
are correctly replicated at the receiver. Since the 
coder also transmits a measure of the probability 
that any given frame of speech is voiced, it is 
possible to average those spectra for which strong 
voicing is unlikely. This results an an estimate of 
the envelope of the spectrum of the background 
noise. A synthetic noise waveform can then be 
generated by creating another FFT buffer with complex 
entries at every frequency using random phases that 
are uniformly distributed over [0,2*], and random 
aplitudes that are uniformly distributed over 
[0,N(«)] where N(«) is the value of the average 
background noise envelope at each FFT frequency 
point, U. This buffer can then be added to the 
pitch-dependent FFT buffer. 
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One objection to this straightforward 
addition is the fact that the noise would already 
have been replicated at the harmonic frequencies and 
in some sense, would have been duplicated in the 
synthesis process. This problem can be avoided by 
using a modest amount of noise suppression by any of 
various techniques known to those skilled in the 
art. For example, the SNR can be measured and the 
gain attenuated by a function of the SNR, such that, 
if the SNR is high, little attenuation is imposed, 
while if the SNR is low, attenuation is increased. 

Since the noise spectrum is known at the 
receiver, the average background noise energy can be 
computed. If this is denoted by 

E n = S N(w)dw 

° (6) 

and if 

E y = / Y(u)do) 

(7) 

denotes the total energy in the envelope of the 
speech plus noise on any given frame, then the SNR 
can be calculated using 



SNR = 



E y - E n 



En (8) 



The output signal level can then be modified 
according to the rule 

y' (co) = Y(u) G(w) 

(9) 
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where the gain G(w) at frequency 03 is given by 
the simple noise-suppression characteristics 



log[G(&))] a « 



'a[log(SNR) - log(SNR for SNR < SNR 

o — o 



otherwise 



(10) 



where the transition at log(SNR Q ) is chosen to 
correspond to about a 3 dB SNR and the slope, a, is 
chosen according to the degree of noise suppression 
desired. (Usually only a modest slope is used 
(=1)). This gain is applied to the amplitudes at 
the pitch harmonics, and the signal level is 
suppressed depending on the amount the SNR is below 
the 3 dB level. Therefore, if speech is absent on 
any given frame, the amplitude entries for the 
harmonic noise will be suppressed, and when the 
resulting buffer is added to the synthetic noise 
buffer, the final contribution to the synthesized 
noise will be given mainly by the average background 
noise envelope. On the other hand, if speech is 
present that exceeds the 3 dB level, it is 
synthesized at the measured level and then added to 
the synthetic noise. Since this noise will always be 
at least 3 dB lower than the speech, it will not 
seriously affect the speech waveform. 



This enhancement system was incorporated 
into the real-time program and was found to 
dramatically improve the quality of the synthesized 
noisy speech. After a short adaption time (=1 
sec), the tonal noise was essentially eliminated, 
having been replaced by colored noise that was truly 
"noise-like. " 
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At low data rates (*2.4 kbps) , it is not 
possible to code any of the sine-wave phases. 
Techniques have been developed to reconstruct an 
appropriate set of phases for use in synthesis, based 
on the idea that all of the sine waves should come 
into phase every pitch-onset time. (See U.S. Serial 
No. 034,097 for further details.) It was shown that 
this property could be achieved by defining a phase 
function for the pitch fundamental that was obtained 
by integrating the instantaneous pitch frequency, 
which in turn was defined to be the linear 
interpolation between the matched fundamental 
frequencies at frame K and frame K+l. This means 
that the phase track would be quadratic over the 
synthesis frame, a condition that was easily realized 
in the sample-base approach to sine-wave synthesis 
using Equation (1). 

With the FFT/overlap-add synthesizer, 
however, the phase variation can, at most, be 
piecewise linear. Therefore, rather than use the 
quadratic phase model to produce an endpoint phase 
and then produce a midpoint phase for the 
FFT/overlap-add method using Equation (4), it is 
preferable to introduce a new phase track for the 
fundamental frequency which is simply the integral of 
the piecewise constant frequencies. 
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The onset times for the raid-point sine waves 

and for the frame K+l sine waves (denoted by 
_ K+l 

n 0 and n 0 ) can be found by locating the times 
at which this phase function crosses the nearest 
multiple of 2ir. The sine-wave phases at each 
frequency u can then be determined using the linear 
phase models: 

8(«) + n 0 « 
rs , , «K+1 

9k + i(o)) » n 0 o> ( 

It will be understood that changes may be 
made in the above construction and in the foregoing 
sequences of operation without departing from the 
scope of the invention. It is, accordingly, intended 
that all matter contained in the above description or 
shown in the accompanying drawings be interpreted as 
illustrative rather than in a limiting sense. 

It is also understood that the following 
claims are intended to cover all of the generic and 
specific features of the invention as described 
herein, and all statements of the scope of the 
invention which, as a matter of language, might be 
said to fall therebetween. 

Having described the invention, what is 
claimed as new and secured by Letters Patent is: 
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CLAjMS 

1. A method of processing an acoustic 
waveform, the method comprising: 

sampling a waveform to obtain a series of 
discrete samples and constructing therefrom a series 
of frames, each frame spanning a plurality of samples; 

analyzing each frame of samples to extract a 
set of frequency components having individual 
amplitudes; 

tracking said components from one frame to a 
nest frame, said tracking including matching a 
component from the one frame with a component in the 
next frame having a similar value; and 

interpolating the values of the components 
from the one frame to the next frame by performing an 
overlap-and-add function utilizing Fourier analysis 
to generate a reconstruction of said waveforms. 

2. The method of claim 1 wherein said 
interpolating step further includes estimating 
mid-frame values and interpolating between said 
mid-frame values and values obtained during each 
frame in order to generate a refined representation 
of the waveform. 

3. The method of claim 2 wherein said 
estimating step further includes deriving mid-frame 
amplitude and frequency values by linear 
interpolation of lagging and leading sine waves. 
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4. The method of claim 2 wherein said 
estimating step further includes providing a 
mid-frame phase value such that the sine wave 
corresponding to the interpolated raid-frame values of 
the parametric representation is best fit to 
predetermined segments of lagging and leading sine 
waves . 

5. The method of claim 2 wherein said 
estimating step further includes deriving mid-frame 
phase values from the lagging and leading sine waves 
according to the following equation: 

6(M) = <e 0 +©l)/2 + (« 0 -« 1 )/2*N/4 + irM 

where M is an integer whose value is chosen, such 
that irM is closest to 

(e 0 -e 1 )/2 + (« o +0) 1 )/2«N/4 

and where 9 Q is the phase of the lagging frame, 
Q 1 is the phase of the leading frame, « o is 
the frequency of the lagging frame, ta x is the 
frequency of the leading frame, and N is the analysis 
frame length. 

6. The method of claim 1 wherein the method 
further includes suppressing tonal noise values. 



BNSOOCID:<WO 6909985A1> 



WO 89/09985 



PCT/US89/01378 



-27- 

7. The method of claim 6 wherein the method 
further includes estimating a noise envelope and 
using said noise envelope estimate to drive a noise 
suppression filter. 

8 The method of claim 6 wherein the method 
further includes generating broadband noise to 
replace said suppressed noise values. 

9. The method of suppressing tonal noise 
artifacts during the reconstruction of an acoustic 
waveform from a sinusoidal parametric representation 
of the waveform, the method comprising; 

estimating a noise envelope from a set of 
frequency components having individual amplitudes 
which comprise a parametric representation of the 
waveform; 

reconstructing an acoustic waveform from 
said parametric representation; and 

filtering Said reconstructed waveform using 
said noise envelope estimates to suppress tonal noise 
estimates . 

10. A method of deriving phase values for 
frequency components during reconstruction of an 
acoustic waveform from a sinusoidal representation of 
the waveform, the method comprising: 

determining a phase of the fundamental 
frequency by integration of a pitch frequency 
obtained by linear interpolation of matched 
fundamental frequencies between successive frames; 
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determining a pitch onset time by locating 
the time at which the phase function crosses the 
nearest multiple of 2ir; and 

allocating phase values to the frequency 
components, such that all of the frequency components 
come into phase every pitch onset time, 

11. A system for processing an acoustic 
waveform, the system comprising 

sampling means for sampling a waveform to 
obtain a series of discrete samples and constructing 
therefrom a series of frames, each frame spanning a 
plurality of samples, 

analyzing means for analyzing each frame of 
samples to extract a set of frequency components 
having individual amplitudes, 

tracking means for tracking said components 
from one frame to a next frame, said tracking means 
including matching means for matching a component 
from the one frame with a component in the next frame 
having a similar value, 

interpolating means for interpolating the 
values of the components from the one frame to the 
next frame, including means for performing an 
overlap-and-add function utilizing Fourier analysis 
to generate a reconstruction of said waveform, 

12. The system of claim 11 wherein said 
interpolating means further includes mid-frame 
estimating means for estimating mid-frame values and 
means for interpolating between said mid-frames 
values and values obtained during each €rame in order 
to generate a refined representation of the waveform. 
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14. The system of claim 12 wherein said 
mid-frame estimating means further includes means for 
linearly interpolating the amplitude and frequency 
values of the lagging and leading sine waves to 
obtain mid-frame values. 

15. The system of claim 12 wherein said 
mid-frame estimating means further includes means for 
deriving mid-frame phase values such that sine waves 
corresponding to the interpolated mid-frame values of 
the parametric representation is best fit to 
predetermined segments of lagging and leading sine 
waves • 

16. The system of claim 12 wherein said 
mid-frame estimating means further includes means for 
deriving mid-frame phase values from lagging and 
leading sine waves according to the following 
equation: 

9(M) = <e 0 +9i)/2 + (g> 0 -&>i)/2«N/4 + irM 

where M is an integer whose value is chosen, such 
that irM is closest to 

(e Q -e i )/2 + (» +» 1 )/2«N/4 

and where 9 Q is the phase of the lagging frame, 
9 1 is the phase of the leading frame, w Q is 
the frequency of the lagging frame, « 1 is the 
frequency of the leading frame, and N is the analysis 
frame length. 
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17. The system of claim 11 wherein said 
system further includes means for suppressing tonal 
values. 

18. The system of claim 17 wherein said 
system further includes noise estimating means for 
estimating a noise envelope and a filter means for 
suppressing tonal noise values in response to said 
noise envelope estimate. 

19. The system of claim 17 wherein said 
system further includes a broadband noise generator 
to replace said suppressed noise values with 
broadband noise. 

20. A receiver for receiving a coded 
parametric representation of an acoustic waveform in 
which the representation comprises as set of 
frequency components having individual amplitudes 
defining same waves which can be summed to recreate 
the waveform at a particular frame of time, the 
receiver comprising: 

decoding means for extracting a set of 
frequency components having individual amplitudes 
from each frame of a coded representation of an 
acoustic waveform; 

tracking means for tracking said components 
from one frame to a next frame, said tracking means, 
including matching means for matching a component 
from the one frame with a component in the next frame 
having a similar value; and 
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interpolation means for interpolating the 
values of the components from the one frame to the 
next frame, including means for performing an 
overlap-and-add function utilizing Fourier analysis, 
to generate a reconstruction of said waveform. 

21. The receiver of claim 20 wherein said 
interpolating means further includes mid-frame 
estimating means for estimating mid-frame values and 
means for interpolating between said mid-frames 
values and values obtained during each frame in order 
to generate a refined representation of the waveform. 

22. The receiver of claim 21 wherein said 
mid-frame estimating means further includes means for 
linearly interpolating the amplitude and frequency 
values of the lagging and leading sine waves to 
obtain mid-frame values. 

23. The receiver of claim 21 wherein said 
mid-frame estimating means further includes means for 
deriving mid-frame phase values such that sine waves 
corresponding to the interpolated mid-frame values of 
the parametric representation is best fit to 
predetermined segments of lagging and leading sine 
waves . 
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24. The receiver of claim 21 wherein said 
mid-frame estimating means further includes means for 
deriving mid-frame phase values from lagging and 
leading sine waves according to the following 
equation: 

©<M) = (0 o +9i)/2 + (tto-ttx)/2*N/4 + *M 

where M is an integer whose value is chosen, such 
that irM is closest to 

<9 o" 9 l >/2 + <» 0 +« 1 )/2«H/4 

and where e Q is the phase of the lagging frame, 

0, is the phase of the leading frame, u is 

o 

the frequency of the lagging frame, is the 
frequency of the leading frame, and N is the analysis 
frame length. 

25. The receiver of claim 20 wherein said 
system further includes means for suppressing tonal 
values. 

26. The receiver of claim 25 wherein said 
system further includes noise estimating means for 
estimating a noise envelope and a filter means for 
suppressing tonal noise values in response to said 
noise envelope estimate. 

27. The receiver of claim 25 wherein said 
system further includes a broadband noise generator 
to replace said suppressed noise values with 
broadband noise. 
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